In [1]:

Data Loading and Preparation¶

Loading Data¶

In [5]:
  File "<ipython-input-5-4e75fe9b1062>", line 1
    uber_apr14= pd.read_csv('F:\Data Science projects\by_other\uber-pickups-in-new-york-city/uber-raw-data-apr14.csv',encoding='utf-8')
                           ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 33-36: truncated \uXXXX escape


just bcz of error I m going to add r before file path to get rid of error

You are getting this error because you are using the path to the file as a string. Change that line to something like this:

In [6]:
Out[6]:
['uber-raw-data-apr14.csv',
 'uber-raw-data-aug14.csv',
 'uber-raw-data-janjune-15.csv',
 'uber-raw-data-jul14.csv',
 'uber-raw-data-jun14.csv',
 'uber-raw-data-may14.csv',
 'uber-raw-data-sep14.csv']
In [7]:
In [8]:
Out[8]:
['uber-raw-data-apr14.csv',
 'uber-raw-data-aug14.csv',
 'uber-raw-data-jul14.csv',
 'uber-raw-data-jun14.csv',
 'uber-raw-data-may14.csv',
 'uber-raw-data-sep14.csv']
In [9]:
In [10]:
Out[10]:
(4534327, 4)

Data Preparation¶

Lat : The latitude of the Uber pickup
Lon : The longitude of the Uber pickup
Base : The TLC base company code affiliated with the Uber pickup
The globe is split into an imaginary 360 sections from both top to bottom (north to south) and 180 sections from side to side (west to east). The sections running from top to bottom on a globe are called longitude, and the sections running from side to side on a globe are called latitude.
Latitude is the measurement of distance north or south of the Equator.
Every location on earth has a global address. Because the address is in numbers, people can communicate about location no matter what language they might speak. A global address is given as two numbers called coordinates. The two numbers are a location's latitude number and its longitude number ("Lat/Long").
In [11]:
In [12]:
Out[12]:
Date/Time Lat Lon Base
0 9/1/2014 0:01:00 40.2201 -74.0021 B02512
1 9/1/2014 0:01:00 40.7500 -74.0027 B02512
2 9/1/2014 0:03:00 40.7559 -73.9864 B02512
3 9/1/2014 0:06:00 40.7450 -73.9889 B02512
4 9/1/2014 0:11:00 40.8145 -73.9444 B02512
In [13]:
Out[13]:
(4534327, 4)
In [14]:
Out[14]:
Date/Time     object
Lat          float64
Lon          float64
Base          object
dtype: object
In [15]:
In [ ]:
In [16]:
Out[16]:
Date/Time    datetime64[ns]
Lat                 float64
Lon                 float64
Base                 object
dtype: object
In [17]:
In [18]:
Out[18]:
Date/Time    datetime64[ns]
Lat                 float64
Lon                 float64
Base                 object
weekday              object
day                   int64
minute                int64
month                 int64
hour                  int64
dtype: object
In [ ]:
In [19]:
Out[19]:
Date/Time Lat Lon Base weekday day minute month hour
0 2014-09-01 00:01:00 40.2201 -74.0021 B02512 Monday 1 1 9 0
1 2014-09-01 00:01:00 40.7500 -74.0027 B02512 Monday 1 1 9 0
2 2014-09-01 00:03:00 40.7559 -73.9864 B02512 Monday 1 3 9 0
3 2014-09-01 00:06:00 40.7450 -73.9889 B02512 Monday 1 6 9 0
4 2014-09-01 00:11:00 40.8145 -73.9444 B02512 Monday 1 11 9 0
In [ ]:
In [20]:
Out[20]:
array(['B02512', 'B02598', 'B02617', 'B02682', 'B02764'], dtype=object)
In [21]:
Out[21]:
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31],
      dtype=int64)
In [22]:
Out[22]:
array(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday',
       'Sunday'], dtype=object)

Analysis of journey by Week-days

In [23]:
In [24]:

seems to have highest sales on Thursday

In [ ]:

Analysis by Hour

In [17]:
Out[17]:
(array([216928., 103517., 227152., 543565., 324851., 366329., 819491.,
        660869., 579117., 692508.]),
 array([ 0. ,  2.3,  4.6,  6.9,  9.2, 11.5, 13.8, 16.1, 18.4, 20.7, 23. ]),
 <a list of 10 Patch objects>)
In [ ]:

It peaks during evening time when people are logging off from work

In [18]:
9
5
6
7
8
4
In [19]:
In [ ]:
In [ ]:

Analysis of Rush of each hour in each month

In [20]:
In [ ]:

analysis of which month has max rides

In [25]:
In [26]:

Analysis of Journey of Each Day

In [18]:
Out[18]:
Text(0.5, 1.0, 'Journeys by Month Day')
In [ ]:
In [ ]:

Analysis of Total rides month wise

In [23]:
In [ ]:

getting Rush in hour

In [12]:
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0xeccdd1c348>
adding hue params
In [52]:
Out[52]:
Text(0.5, 1.0, 'hoursoffday vs latiitide of passenger')
In [35]:
Out[35]:
Date/Time Lat Lon Base weekday day minute month hour
0 2014-04-01 00:11:00 40.7690 -73.9549 B02512 Tuesday 1 11 4 0
1 2014-04-01 00:17:00 40.7267 -74.0345 B02512 Tuesday 1 17 4 0
2 2014-04-01 00:21:00 40.7316 -73.9873 B02512 Tuesday 1 21 4 0
3 2014-04-01 00:28:00 40.7588 -73.9776 B02512 Tuesday 1 28 4 0
4 2014-04-01 00:33:00 40.7594 -73.9722 B02512 Tuesday 1 33 4 0
In [36]:
Out[36]:
0    B02512
1    B02512
2    B02512
3    B02512
4    B02512
Name: Base, dtype: object
In [39]:
Out[39]:
Base    month
B02512  4         35536
        5         36765
        6         32509
        7         35021
        8         31472
        9         34370
B02598  4        183263
        5        260549
        6        242975
        7        245597
        8        220129
        9        240600
B02617  4        108001
        5        122734
        6        184460
        7        310160
        8        355803
        9        377695
B02682  4        227808
        5        222883
        6        194926
        7        196754
        8        173280
        9        197138
B02764  4          9908
        5          9504
        6          8974
        7          8589
        8         48591
        9        178333
Name: Date/Time, dtype: int64
In [42]:
Out[42]:
Base month Date/Time
0 B02512 4 35536
1 B02512 5 36765
2 B02512 6 32509
3 B02512 7 35021
4 B02512 8 31472
5 B02512 9 34370
6 B02598 4 183263
7 B02598 5 260549
8 B02598 6 242975
9 B02598 7 245597
10 B02598 8 220129
11 B02598 9 240600
12 B02617 4 108001
13 B02617 5 122734
14 B02617 6 184460
15 B02617 7 310160
16 B02617 8 355803
17 B02617 9 377695
18 B02682 4 227808
19 B02682 5 222883
20 B02682 6 194926
21 B02682 7 196754
22 B02682 8 173280
23 B02682 9 197138
24 B02764 4 9908
25 B02764 5 9504
26 B02764 6 8974
27 B02764 7 8589
28 B02764 8 48591
29 B02764 9 178333
In [45]:
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0xec9e043f48>
In [ ]:

2 Cross Analysis

Through our exploration we are going to visualize:

1.Heatmap by Hour and Weekday.

2.Heatmap by Hour and Day.

3.Heatmap by Month and Day.

4.Heatmap by Month and Weekday.

Heatmap by Hour and Weekday.

create pivot_tables

simplest way of creating pivot tables,first of all call groupby on 2 columns so that we will get groups
df.groupby(['weekday','hour']).apply(lambda x: len(x)), now "weekday" becomes rows and "hour" becomes cols
& then call unstack
In [9]:
In [10]:
Out[10]:
weekday    hour
Friday     0       13716
           1        8163
           2        5350
           3        6930
           4        8806
                   ...  
Wednesday  19      47017
           20      47772
           21      44553
           22      32868
           23      18146
Length: 168, dtype: int64
In [11]:
Out[11]:
hour 0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
weekday
Friday 13716 8163 5350 6930 8806 13450 23412 32061 31509 25230 ... 36206 43673 48169 51961 54762 49595 43542 48323 49409 41260
Monday 6436 3737 2938 6232 9640 15032 23746 31159 29265 22197 ... 28157 32744 38770 42023 37000 34159 32849 28925 20158 11811
Saturday 27633 19189 12710 9542 6846 7084 8579 11014 14411 17669 ... 31418 38769 43512 42844 45883 41098 38714 43826 47951 43174
Sunday 32877 23015 15436 10597 6374 6169 6596 8728 12128 16401 ... 28151 31112 33038 31521 28291 25948 25076 23967 19566 12166
Thursday 9293 5290 3719 5637 8505 14169 27065 37038 35431 27812 ... 36699 44442 50560 56704 55825 51907 51990 51953 44194 27764
Tuesday 6237 3509 2571 4494 7548 14241 26872 36599 33934 25023 ... 34846 41338 48667 55500 50186 44789 44661 39913 27712 14869
Wednesday 7644 4324 3141 4855 7511 13794 26943 36495 33826 25635 ... 35148 43388 50684 55637 52732 47017 47772 44553 32868 18146

7 rows × 24 columns

creating heatmap so that it can be easily visualize
In [12]:
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0xf1244adc48>
In [13]:
Out[13]:
Date/Time Lat Lon Base weekday day minute month hour
0 2014-04-01 00:11:00 40.7690 -73.9549 B02512 Tuesday 1 11 4 0
1 2014-04-01 00:17:00 40.7267 -74.0345 B02512 Tuesday 1 17 4 0
2 2014-04-01 00:21:00 40.7316 -73.9873 B02512 Tuesday 1 21 4 0
3 2014-04-01 00:28:00 40.7588 -73.9776 B02512 Tuesday 1 28 4 0
4 2014-04-01 00:33:00 40.7594 -73.9722 B02512 Tuesday 1 33 4 0
In [20]:
In [21]:
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0xf124bcb108>
In [24]:
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0xf124f98a08>

Analysing the results

We observe that the number of trips increases each month, we can say that from April to September 2014, Uber was in a continuous improvement process.

In [26]:
Out[26]:
Date/Time Lat Lon Base weekday day minute month hour
0 2014-04-01 00:11:00 40.7690 -73.9549 B02512 Tuesday 1 11 4 0
1 2014-04-01 00:17:00 40.7267 -74.0345 B02512 Tuesday 1 17 4 0
2 2014-04-01 00:21:00 40.7316 -73.9873 B02512 Tuesday 1 21 4 0
3 2014-04-01 00:28:00 40.7588 -73.9776 B02512 Tuesday 1 28 4 0
4 2014-04-01 00:33:00 40.7594 -73.9722 B02512 Tuesday 1 33 4 0
... ... ... ... ... ... ... ... ... ...
564511 2014-04-30 23:22:00 40.7640 -73.9744 B02764 Wednesday 30 22 4 23
564512 2014-04-30 23:26:00 40.7629 -73.9672 B02764 Wednesday 30 26 4 23
564513 2014-04-30 23:31:00 40.7443 -73.9889 B02764 Wednesday 30 31 4 23
564514 2014-04-30 23:32:00 40.6756 -73.9405 B02764 Wednesday 30 32 4 23
564515 2014-04-30 23:48:00 40.6880 -73.9608 B02764 Wednesday 30 48 4 23

564516 rows × 9 columns

In [ ]:
In [25]:
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0xf124c189c8>
In [ ]:

Analysis of Location data points¶

In [21]:
Out[21]:
(40.6, 41)
We can see a number of hot spots here. Midtown Manhattan is clearly a huge bright spot.
& these are made from Midtown to Lower Manhattan.
Followed by Upper Manhattan and the Heights of Brooklyn.
In [ ]:

perform Spatial Analysis using heatmap to get a clear cut of Rush on Sunday(Weekend)

In [27]:
Out[27]:
Date/Time Lat Lon Base weekday day minute month hour
0 2014-09-01 00:01:00 40.2201 -74.0021 B02512 Monday 1 1 9 0
1 2014-09-01 00:01:00 40.7500 -74.0027 B02512 Monday 1 1 9 0
2 2014-09-01 00:03:00 40.7559 -73.9864 B02512 Monday 1 3 9 0
3 2014-09-01 00:06:00 40.7450 -73.9889 B02512 Monday 1 6 9 0
4 2014-09-01 00:11:00 40.8145 -73.9444 B02512 Monday 1 11 9 0
In [28]:
Out[28]:
Date/Time Lat Lon Base weekday day minute month hour
8011 2014-09-07 00:00:00 40.7341 -74.0005 B02512 Sunday 7 0 9 0
8012 2014-09-07 00:00:00 40.7344 -73.9900 B02512 Sunday 7 0 9 0
8013 2014-09-07 00:00:00 40.7806 -73.9582 B02512 Sunday 7 0 9 0
8014 2014-09-07 00:01:00 40.7293 -73.9859 B02512 Sunday 7 1 9 0
8015 2014-09-07 00:01:00 40.7713 -74.0133 B02512 Sunday 7 1 9 0
In [29]:
Out[29]:
Lat Lon weekday
0 39.9374 -74.0722 1
1 39.9378 -74.0721 1
2 39.9384 -74.0742 1
3 39.9385 -74.0734 1
4 39.9415 -74.0736 1
... ... ... ...
209225 41.3141 -74.1249 1
209226 41.3180 -74.1298 1
209227 41.3195 -73.6905 1
209228 41.3197 -73.6903 1
209229 42.1166 -72.0666 1

209230 rows × 3 columns

In [31]:
In [32]:
In [33]:
Out[33]:
In [ ]:
Lets create a function for a specific day
In [34]:
In [35]:
Out[35]:
In [ ]:
In [ ]:

Analysis of Jan-June uber_15

In [36]:
Out[36]:
Dispatching_base_num Pickup_date Affiliated_base_num locationID
0 B02617 2015-05-17 09:47:00 B02617 141
1 B02617 2015-05-17 09:47:00 B02617 65
2 B02617 2015-05-17 09:47:00 B02617 100
3 B02617 2015-05-17 09:47:00 B02774 80
4 B02617 2015-05-17 09:47:00 B02617 90
In [98]:
Out[98]:
(14270479, 4)
In [ ]:
In [99]:
Out[99]:
'2015-01-01 00:00:05'
In [100]:
Out[100]:
'2015-06-30 23:59:00'
In [37]:
In [38]:
In [39]:
Out[39]:
Dispatching_base_num Pickup_date Affiliated_base_num locationID weekday day minute month hour
0 B02617 2015-05-17 09:47:00 B02617 141 Sunday 17 47 5 9
1 B02617 2015-05-17 09:47:00 B02617 65 Sunday 17 47 5 9
2 B02617 2015-05-17 09:47:00 B02617 100 Sunday 17 47 5 9
3 B02617 2015-05-17 09:47:00 B02774 80 Sunday 17 47 5 9
4 B02617 2015-05-17 09:47:00 B02617 90 Sunday 17 47 5 9
Uber pickups by the month in NYC
In [40]:

We can see that the number of Uber pickup has been steadily increasing throughout the first half of 2015 in NYC

In [ ]:

Analysing Rush in New york City

In [33]:
Interestingly, after the morning rush, the number of Uber pickups doesn't dip much throughout the rest of the morning and early afternoon. There is significantly more demand in the evening than the daytime. Let's investigate to see if there's a difference in hourly pattern for different days of the week.
In [ ]:

Analysing In-Depth Analysis of Rush in New york City Day & hour wise

group the data by Weekday and hour
In [114]:
Out[114]:
weekday    hour
Friday     0        85939
           1        46616
           2        28102
           3        19518
           4        23575
                    ...  
Wednesday  19      143751
           20      136003
           21      133993
           22      127026
           23       99490
Name: Pickup_date, Length: 168, dtype: int64
In [115]:
Out[115]:
weekday hour Pickup_date
0 Friday 0 85939
1 Friday 1 46616
2 Friday 2 28102
3 Friday 3 19518
4 Friday 4 23575
... ... ... ...
163 Wednesday 19 143751
164 Wednesday 20 136003
165 Wednesday 21 133993
166 Wednesday 22 127026
167 Wednesday 23 99490

168 rows × 3 columns

In [116]:
In [117]:
Out[117]:
weekday hour Counts
0 Friday 0 85939
1 Friday 1 46616
2 Friday 2 28102
3 Friday 3 19518
4 Friday 4 23575
... ... ... ...
163 Wednesday 19 143751
164 Wednesday 20 136003
165 Wednesday 21 133993
166 Wednesday 22 127026
167 Wednesday 23 99490

168 rows × 3 columns

In [51]:
Out[51]:
<matplotlib.axes._subplots.AxesSubplot at 0xe72af5f108>
In [ ]:
In [ ]:
Loading Uber-Jan-Feb-FOIL.csv
In [118]:
In [119]:
Out[119]:
dispatching_base_number date active_vehicles trips
0 B02512 1/1/2015 190 1132
1 B02765 1/1/2015 225 1765
2 B02764 1/1/2015 3427 29421
3 B02682 1/1/2015 945 7679
4 B02617 1/1/2015 1228 9537
In [120]:
Out[120]:
array(['B02512', 'B02765', 'B02764', 'B02682', 'B02617', 'B02598'],
      dtype=object)
In [121]:
Out[121]:
<matplotlib.axes._subplots.AxesSubplot at 0x4ff12912c8>

seems to have more number of Active Vehicles in B02764

In [ ]:
In [122]:
Out[122]:
<matplotlib.axes._subplots.AxesSubplot at 0x507b6f7508>

seems to have more number of trips in B02764

In [ ]:
In [123]:
In [124]:
Out[124]:
dispatching_base_number date active_vehicles trips trips/vehicle
0 B02512 1/1/2015 190 1132 5.957895
1 B02765 1/1/2015 225 1765 7.844444
2 B02764 1/1/2015 3427 29421 8.585060
3 B02682 1/1/2015 945 7679 8.125926
4 B02617 1/1/2015 1228 9537 7.766287
In [125]:
Out[125]:
dispatching_base_number active_vehicles trips trips/vehicle
date
1/1/2015 B02512 190 1132 5.957895
1/1/2015 B02765 225 1765 7.844444
1/1/2015 B02764 3427 29421 8.585060
1/1/2015 B02682 945 7679 8.125926
1/1/2015 B02617 1228 9537 7.766287
... ... ... ... ...
2/28/2015 B02764 3952 39812 10.073887
2/28/2015 B02617 1372 14022 10.220117
2/28/2015 B02682 1386 14472 10.441558
2/28/2015 B02512 230 1803 7.839130
2/28/2015 B02765 747 7753 10.378849

354 rows × 4 columns

how Average trips/vehicle inc/decreases with dates with each of base umber
In [53]:
Out[53]:
<matplotlib.legend.Legend at 0xecd1643bc8>
In [ ]:
In [ ]: